Packages
library(tidyr)
library(ggplot2)
library(dplyr)
library(ggcorrplot)
library(plotly)
library(sampling)
library(ggpubr)
library(manipulateWidget)This data set was retrieved from Kaggle. This dataset contains information on the relationship between personal attributes (age, gender, BMI, children dependents, smoking habits), geographic factors, and their impact on medical insurance charges.
It can be used to study how these attributes influence insurance costs.
health_ins<-read.csv('C:\\Deepa\\Boston University\\Assignments\\MET CS 544\\Final Project\\Final Project dataset\\insurance.csv')
head(health_ins) ## age sex bmi children smoker region charges
## 1 19 female 27.900 0 yes southwest 16884.924
## 2 18 male 33.770 1 no southeast 1725.552
## 3 28 male 33.000 3 no southeast 4449.462
## 4 33 male 22.705 0 no northwest 21984.471
## 5 32 male 28.880 0 no northwest 3866.855
## 6 31 female 25.740 0 no southeast 3756.622
Convert Categorical Variables to Factors for analysis
health_ins$sex<-as.factor(health_ins$sex)
health_ins$smoker<-as.factor(health_ins$smoker)
health_ins$region<-as.factor(health_ins$region)
health_ins$children<-as.factor(health_ins$children)Summary statistics for the data set, providing information like mean, median, quartiles, and counts for each variable.
## age sex bmi children smoker
## Min. :18.00 female:662 Min. :15.96 0:574 no :1064
## 1st Qu.:27.00 male :676 1st Qu.:26.30 1:324 yes: 274
## Median :39.00 Median :30.40 2:240
## Mean :39.21 Mean :30.66 3:157
## 3rd Qu.:51.00 3rd Qu.:34.69 4: 25
## Max. :64.00 Max. :53.13 5: 18
## region charges
## northeast:324 Min. : 1122
## northwest:325 1st Qu.: 4740
## southeast:364 Median : 9382
## southwest:325 Mean :13270
## 3rd Qu.:16640
## Max. :63770
Data frame created by grouping the health_ins data frame by the ‘sex’ column calculating the count of each group
Analyzing distribution of insurance coverage among males and females in the data set, offering insights into the gender-based patterns of insurance enrollment.
Below plot provides a clear visual comparison of the number of insured individuals in each gender category. The number of females and males are almost equal, males being slightly higher than females in the data set
a<-ggplot(data=df, aes(x=sex, y=count, fill=sex)) +
geom_bar(stat="identity",width=0.3) +
scale_y_continuous(breaks=seq(0,700,200)) +
theme_minimal()+
scale_x_discrete(labels = c("Female", "Male")) +
theme(legend.position="right")+
ggtitle("Barplot of Frequency of Females and Males")+
theme(plot.title = element_text(hjust=0.5))+
labs(x = "Gender", y = "Count")
ggplotly(a)The histogram illustrates the distribution of individuals with insurance coverage across different age groups. Notably, the number of insured individuals in their 20s surpasses that of other age categories, with the count in the 60s being the lowest. This trend suggests that people in their 20s may demonstrate a greater propensity to obtain insurance, potentially influenced by factors such as heightened awareness of health and financial planning.
b<-ggplot(data=health_ins, aes(x=age)) +
geom_histogram(fill = "blue",bins=40, colour='white') +
labs(x ='Age', y='Count', title = 'Age wise Count') +
theme_minimal() +
theme(plot.title = element_text(hjust=0.5))
ggplotly(b)Groups the ‘health_ins’ dataframe by the ‘sex’ and ‘smoker’ columns. Calculates the number of individuals for each combination of ‘sex’ and ‘smoker’ groups.
Bar plot depicts the percentage of male and female smokers in the data set. The analysis shows that proportion of smokers in male category are higher than the females
c<-ggplot(df1, aes(x = sex, y = count, fill = smoker)) +
geom_bar(stat = "identity", position = "fill") +
labs(title = "Counts of Male and Female Smokers",
x = "Smoker",
y = "Percentage") +
theme_minimal() +
theme(plot.title = element_text(hjust=0.5))+
scale_fill_manual(values = c("yes" = "pink", "no" = "blue")) +
scale_y_continuous(labels = scales::percent_format())
ggplotly(c)The below is the Box plot distribution of insurance charges in the ‘health_ins’ data set, stratified by gender (‘sex’) and further differentiated by smoking status (‘smoker’). It allows for a quick comparison of central tendencies, variability, and the presence of outlier in charges across these groups As observed from the box plot the insurance charges are expectedly higher for smokers as compared to non smokers. Also interesting median charges of male smokers is comparatively higher than female smokers There are outliers in non smokers that could be because of other factors like dependents or bmi or region
Correlation matrix shows the pairwise correlation coefficients between these variables
## age bmi charges
## age 1.0000000 0.1092719 0.2990082
## bmi 0.1092719 1.0000000 0.1983410
## charges 0.2990082 0.1983410 1.0000000
The correlation coefficient between ‘age’ and ‘bmi’ is approximately 0.109 indicating a weak positive correlation between them.As age increases, BMI tends to slightly increase.
The correlation coefficient between ‘bmi’ and ‘charges’ is approximately 0.198 indicating a weak positive correlation between them. As bmi increases, charges tends to slightly increase.
The correlation coefficient between ‘age’ and ‘charges’ is approximately 0.299.This indicates age and charges are moderately correlated and as age increases charges tend to increase.
The histogram reveals a predominance of individuals with insurance charges around $20,000, suggesting a right-skewed distribution. This implies that a greater proportion of people are paying lower insurance charges compared to those with higher costs. This aligns with common sense, as those with elevated insurance charges may be influenced by factors such as smoking, older age, and geographical location.
options(scipen=999)
x<-ggplot(data=health_ins, aes(x=charges)) +
geom_histogram(fill = "blue",bins=40, colour='white',aes(y=..density.. )) +
labs(x ='Charges', y='Density', title = 'Insurance charges') +
theme_minimal() + geom_density(linetype='dashed', lwd=1.2)+
theme(plot.title = element_text(hjust=0.5))
ggplotly(x)Distribution of charges across different regions, with a distinction between smokers and non-smokers. The median charges are highest in southeast for smokers followed by southwest. This shows that insurance charges in southern region is high as compared to northern regions. Interestingly, for non-smokers, the charges are in almost same range across regions
The central limit theorem (CLT) states that the distribution of sample means approximates a normal distribution as the sample size gets larger, regardless of the population’s distribution. Using the charges attribute in this data set the applicability of the central limit theorem can be shown. There is a right-skewed distribution of insurance charges in the data set as displayed in the insurance charges histogram Below are histograms showing the sample means of 1000 random samples of sample sizes 10, 30, 60, and 80 that follow a normal distribution. As sample size increases the distribution tends to be more normally distributed.
Number of samples and different sample sizes
options(scipen=999)
sample<- 1000
sample.size1<- 10
sample.size2<- 30
sample.size3<- 60
sample.size4<- 80
#variable to store sample means for different sample sizes
xb1<- numeric(sample)
xb2<- numeric(sample)
xb3<- numeric(sample)
xb4<- numeric(sample)
set.seed(2698) # Set seed for reproducibility
# Generate samples and calculate means for each sample size
for(i in 1:sample){
xb1[i]<- mean(sample(health_ins$charges, sample.size1, replace = F))
xb2[i]<- mean(sample(health_ins$charges, sample.size2, replace = F))
xb3[i]<- mean(sample(health_ins$charges, sample.size3, replace = F))
xb4[i]<- mean(sample(health_ins$charges, sample.size4, replace = F))
}
# Create histograms for each sample size
d1<-data_frame(val = xb1) %>%
ggplot(., aes(val)) +
geom_histogram(fill="blue", aes(y=after_stat(density))) +xlab('Mean of charges')+ylab('density')+
ggtitle('Distribution of mean of charges N=10') +xlim(0, 30000) +
ylim(0, 0.00025)
d2<-data_frame(val = xb2) %>%
ggplot(., aes(val)) +
geom_histogram(fill="red", aes(y=after_stat(density))) +xlab('Mean of charges')+ylab('density')+
ggtitle('Distribution of mean of charges N=30') +xlim(0, 30000) +
ylim(0, 0.00025)
d3<-data_frame(val = xb3) %>%
ggplot(., aes(val)) +
geom_histogram(fill="green", aes(y=after_stat(density))) +xlab('Mean of charges')+ylab('density')+
ggtitle('Distribution of mean of charges N=60')+xlim(0, 30000)
d4<-data_frame(val = xb4) %>%
ggplot(., aes(val)) +
geom_histogram(fill="pink", aes(y=after_stat(density))) +xlab('Mean of charges')+ylab('density')+
ggtitle('Distribution of mean of charges N=80') +xlim(0, 30000) +ylim(0, 0.0003)
ggarrange(d1,d2,d3,d4,nrow=2,ncol=2) ## Population mean is: 13270.42
## Population standard devation is: 12110.01
## Mean of sample means of size N=10 is: 13255.76
## SD of sample means of size N=10 is: 3786.774
## Mean of sample means of size N=30 is: 13226.56
## SD of sample means of size N=30 is: 2193.839
## Mean of sample means of size N=60 is: 13166.96
## SD of sample means of size N=60 is: 1557.94
## Mean of sample means of size N=80 is: 13239.42
## SD of sample means of size N=80 is: 1296.041
Sampling is the selection of a subset or a statistical sample of individuals from within a statistical population to estimate characteristics of the whole population. There are many different sampling techniques including simple random sampling, systematic sampling, and stratified sampling Simple Random Sampling Without Replacement (SRSWOR) is a method of sampling where every possible sample of a certain size has an equal chance of being selected, and once a unit is selected, it is not put back into the population. Systematic sampling is a method where samples are selected via a fixed interval. Stratified sampling is when the larger group of data is broken into smaller strata and then certain sizes are picked from each group.
We are analyzing the relationship between Body Mass Index (BMI), smoking status and insurance charges. The analysis is conducted on the entire population and using three different sampling methods as stated above. The sample size of 500 is being used The scatter plots and regression lines provide insights into how the relationship varies across different sampling techniques. As seen in Scatter plot, the charges proportionately increase with increase in bmi for smokers while this is not the case with non smokers. In non smokers charges are on lower side with few outliers that could be due to dependents or region they belong to.
a1<-ggplot(health_ins, aes(x = bmi, y = charges))+
geom_point(aes(color = smoker, shape = smoker))+
geom_smooth(aes(color = smoker, fill = smoker),
method = "lm", fullrange = TRUE) +
facet_wrap(~smoker) +
stat_cor(aes(label = ..r.label..))+
scale_color_manual(values = c("#00AFBB", "#E7B800"))+
scale_fill_manual(values = c("#00AFBB", "#E7B800")) +
theme_bw() + ggtitle("Population graph")
# Simple Random Sampling Without Replacement (SRSWOR)
set.seed(2698)
s <- srswor(500, nrow(health_ins))
sample.2 <- health_ins[s != 0, ]
b1<-ggplot(sample.2, aes(x = bmi, y = charges))+
geom_point(aes(color = smoker, shape = smoker))+
geom_smooth(aes(color = smoker, fill = smoker),
method = "lm", fullrange = TRUE) +
facet_wrap(~smoker) +
stat_cor(aes(label = ..r.label..))+
scale_color_manual(values = c("#00AFBB", "#E7B800"))+
scale_fill_manual(values = c("#00AFBB", "#E7B800")) +
theme_bw()+ ggtitle("SRSWOR")
#Systematic sampling
N<-nrow(health_ins)
n<-500
k<-ceiling(N/n)
# random item from first group
r <- sample(k, 1)
# select every kth item
s<-seq(r, by = k, length = n)
set.seed(2698)
sample.3 <- health_ins[s, ]
b2<-ggplot(sample.3[!is.na(sample.3$age),], aes(x = bmi, y = charges))+
geom_point(aes(color = smoker, shape = smoker))+
geom_smooth(aes(color = smoker, fill = smoker),
method = "lm", fullrange = TRUE) +
facet_wrap(~smoker) +
stat_cor(aes(label = ..r.label..))+
scale_color_manual(values = c("#00AFBB", "#E7B800"))+
scale_fill_manual(values = c("#00AFBB", "#E7B800")) +
theme_bw() + ggtitle("Systematic Sampling")
#Stratified sampling
set.seed(2698)
st.1 <- sampling::strata(health_ins, stratanames = c("region"),
size = rep(125, 4), method = "srswor",
description = TRUE)## Stratum 1
##
## Population total and number of selected units: 325 125
## Stratum 2
##
## Population total and number of selected units: 364 125
## Stratum 3
##
## Population total and number of selected units: 325 125
## Stratum 4
##
## Population total and number of selected units: 324 125
## Number of strata 4
## Total number of selected units 500
st.sample1 <- sampling::getdata(health_ins, st.1)
b3<-ggplot(st.sample1, aes(x = bmi, y = charges))+
geom_point(aes(color = smoker, shape = smoker))+
geom_smooth(aes(color = smoker, fill = smoker),
method = "lm", fullrange = TRUE) +
facet_wrap(~smoker) +
stat_cor(aes(label = ..r.label..))+
scale_color_manual(values = c("#00AFBB", "#E7B800"))+
scale_fill_manual(values = c("#00AFBB", "#E7B800")) +
theme_bw()+ ggtitle("Stratified Sampling")
ggarrange(a1,b1,b2,b3, nrow = 2, ncol = 2)paste0("Population has percentage of smokers",round(table(health_ins$smoker)[2]/nrow(health_ins)*100,2),' %',' and SWSWOR sample has percentage of smokers: ',table(sample.2$smoker)[2]/nrow(sample.2)*100,' %')## [1] "Population has percentage of smokers20.48 % and SWSWOR sample has percentage of smokers: 24.6 %"
paste0("Population has percentage of smokers",round(table(health_ins$smoker)[2]/nrow(health_ins)*100,2),' %',' and Systematic sample has percentage of smokers: ',table(sample.3$smoker)[2]/nrow(sample.3)*100,' %')## [1] "Population has percentage of smokers20.48 % and Systematic sample has percentage of smokers: 20 %"
paste0("Population has percentage of smokers ",round(table(health_ins$smoker)[2]/nrow(health_ins)*100,2),' %',' and stratfied sample has percentage of smokers: ',table(st.sample1$smoker)[2]/nrow(st.sample1)*100,' %')## [1] "Population has percentage of smokers 20.48 % and stratfied sample has percentage of smokers: 17.6 %"
This contains information about the top two regions with the highest average charges.The dataset (health_ins) is grouped by the region and mean of the charges is calculated for each region The top two regions are selected.
top_2_region<- health_ins %>% group_by(region) %>% summarise(mean_charges=mean(charges)) %>%
arrange(-mean_charges) %>% filter(row_number()<=2)
top_2_region$region<- as.character(top_2_region$region)## Region: southeast Mean Charges: 14735.41
## Region: northeast Mean Charges: 13406.38
This analysis helps explore the relationship between age, number of dependents, and gender in the given dataset
Data set is grouped by both children and sex variables.The mean of the age variable is calculated for each combination of children and sex. The results are spread to have separate columns for male and female mean ages
sex_avg_age<-health_ins %>% group_by(children,sex) %>% summarise(mean_age=round(mean(age),1)) %>%
select(children, sex, mean_age) %>% spread(sex,mean_age)
#mean age for each children group is calculated
avg_age<- health_ins %>% group_by(children) %>% summarise(overall_age=round(mean(age),1))
#The results are joined to have a final table with mean ages for both sexes and the overall mean age
final_avg_age<- inner_join(sex_avg_age,avg_age)
final_avg_age## # A tibble: 6 × 4
## # Groups: children [6]
## children female male overall_age
## <fct> <dbl> <dbl> <dbl>
## 1 0 38.3 38.5 38.4
## 2 1 39.5 39.4 39.5
## 3 2 40.5 38.4 39.4
## 4 3 42.2 41 41.6
## 5 4 42 36.6 39
## 6 5 37 34.5 35.6
The data set explores the relationship between insurance charges and how they are affected by various attributes. There are some observations in the data set that might not be in sync with reality as the author has taken into account specific attributes for analysis. There are social and environmental factors at play which might be the reason for data being skewed in some observations.Overall, the data reflects how personal and geographic attributes can affect health insurance charges.